This workshop aims to give you a taste of bioinformatics by utalising some basic tools and analyses to explore and interpret both genetic and epigenetic data.
We’re going to be exploring some of the topics that you’ve covered in genetics and epigenetics over the last couple of days. As such this workshop consists of two parts:
- Population Genetics
- how do we explore genetic variation between populations?
- what does this look like?
- Epigenetics
- can we use epigenetics in a similar way as genetic variation to distinguish cells/tissues?
As always please don’t hesitate to ask your questions and prompt discussion!
1. Population Genetics
Note: The majority of code and data for this section is borrowed/modified from the course GENE360 maintained by Associate Professor Mik Black. You can find the full resource here: https://github.com/mikblack/GENE360-PopDiv
Overview
The objective of this module is to give you some familiarity with the types of data and tools that are commonly used for genetic studies of population diversity.
The data we will be using are publicly available high-throughput sequencing data from the 1000 Genomes Project. The main web page for the 1000 Genomes Project is:
The following page provides background information about the 1000 Genomes Project (and is worth reading, particularly the information about the populations being used in the study):
http://www.1000genomes.org/about
We will be using the R and RStudio software applications to analyze the data.
R: https://cran.rstudio.com/ RStudio: https://www.rstudio.com/product/rstudio/download/
Getting started - loading the data
The first data set we will look at contains genetic data on just a few loci (specific points in the genome) across the various populations being studied. Looking at this small data set will give you a feel for what we mean by “genotype data” for a particular locus.
Open RStudio and read in the first data set using the following command:
## Read in the SNP data
snpData = read.table('data/GENE360snpData.txt',sep='\t',header=T)Looking at SNP frequencies
The other nine columns in the data set relate to specific single nucleotide polymorphisms (SNPs) in the genome - we have the genotype data for each SNP for every individual in the data set.
We can use the “table” command again to summarize the genotype information for each SNP:
## Make a genotype frequency table for the first SNP
table(snpData$rs3826656)| AA | AG | GG |
|---|---|---|
| 1112 | 952 | 440 |
We can also calculate the proportions associated with each genotype:
## Calculate proportions
prop.table(table(snpData$rs3826656))| AA | AG | GG |
|---|---|---|
| 0.4440895 | 0.3801917 | 0.1757188 |
and examine differences in genotype frequencies across populations:
## Create contingeny table - genotypes across populations
table(snpData$Population, snpData$rs3826656)| / | AA | AG | GG |
|---|---|---|---|
| AFR | 445 | 199 | 17 |
| AMR | 225 | 109 | 13 |
| EAS | 47 | 224 | 233 |
| EUR | 312 | 161 | 30 |
| SAS | 83 | 259 | 147 |
## And calculate proportions (rounded)
round(prop.table(table(snpData$Population, snpData$rs3826656),1),2)| / | AA | AG | GG |
|---|---|---|---|
| AFR | 0.67 | 0.30 | 0.03 |
| AMR | 0.65 | 0.31 | 0.04 |
| EAS | 0.09 | 0.44 | 0.46 |
| EUR | 0.62 | 0.32 | 0.06 |
| SAS | 0.17 | 0.53 | 0.30 |
These results can be plotted as a bar plot:
snpFreqs = t(prop.table(table(snpData$Population, snpData$rs3826656), 1))
barplot(snpFreqs, beside=TRUE, legend.text=TRUE, ylim=c(0,1))These analyses can be repeated for different SNPs by changing the SNP ID (e.g., rs3826656) in the above commands.
Population diversity
The second data set that we will look at can be used to investigate population (genetic) diversity. These SNPs were chosen because they are “ancestry informative” - variation in their genotype frequencies tends to be associated with differences between population groups.
## Load ancestry data
snpAns = read.table('data/GENE360snpAncestry.txt', header=T, sep='\t')
## Check dimensionality of data
dim(snpAns)## [1] 2504 2305
Now we have a lot more data: 2504 individuals, and 2302 SNP genotypes for each. Note that there are 2305 columns in the data set, but the first three relate to non-SNP data.
To examine population diversity, we need to do two things:
- create a data object of ONLY the SNP genotype data (i.e., remove the first three columns).
- convert the genotypes to allele counts (e.g., TT, AT, AA to 0, 1, 2 - count the number of A’s).
Step 1:
## Create object containing only the SNP data - remove the first three columns
snpAnsDat = snpAns[,-c(1,2,3)]Step 2:
## Load a custom function to convert to allele counts
source('alleleCounts.R')
## Apply the function to the genotype data, one column (SNP) at a time
## This will take 30 seconds or so...
snpAnsCount = apply(snpAnsDat, 2, alleleCounts)Now we have a data set of just the SNP data, with genotypes converted to allele counts.
So, what are we going to do with this new data set?
Dimension reduction (principal components analysis)
In terms of examining population diversity, we have 2302 dimensions of data available - one dimension for each SNP.
Rather than trying to comprehend this huge amount of data in 2302-dimensional space, genetics researchers often use a statistical tool called Principal Components Analysis (PCA) to reduce the dimensionality of the data.
The idea is to find the most important variation in the data, and examine the samples in terms of that variation, ignoring the rest. In practice, this works fairly well, because genetic differences between populations provide a strong (and relatively consistent) source of variation across genomic loci (i.e., SNPs). Rather than looking at 2302 dimensions of data, we end up looking at variation across just 2 or 3 dimensions - each dimension is defined by a combination of SNPs which vary in a similar way across the individuals in the study.
We can plot the principal components and colour the points based on the population each sample belongs to.
# use flashpcaR package for fast PCA analysis
library(flashpcaR)
# load scales for transparent points
library(scales)
# perform PCA
r <- flashpca(snpAnsCount, do_loadings=TRUE, verbose=TRUE, stand="binom", ndim=3)PCA plots
From the plots we can see that samples from the same population tend to cluster together, and that the first three principal components do a reasonable job of capturing the genetic diversity between the populations.
With the scatterplot3d package, you can plot the first three principal components at once (i.e., combining the information from the three scatterplots above). This shows that the European (EUR), East Asian (EAS) and South Asian (SAS) super-populations are relatively homogeneous, while the Ad-Mixed American (AMR) and African (AFR) super-populations exhibit greater variation, suggesting admixture within these groups.
library(scatterplot3d)
scatterplot3d(r$vectors[,1], r$vectors[,2], r$vectors[,3], color=as.numeric(snpAns$Population), pch=16,
cex.symbols=0.5, xlab="PC1", ylab="PC2", zlab="PC3")
legend(-3.5, 2.5, legend = c('AFR', 'SAS', 'EAS', 'AMR', 'EUR'), col = c('black', 'green', 'cyan', 'red', 'blue'), pch = 16, bty = 'n')Interactive 3D plot
Phylogenetic trees
Another way to visualise genetic similarity is via “phylogenetic trees”, also known more generally as “dendrograms”.
These tree diagrams group items together based on similarity scores. In this setting our “items” are the populations, and the scores are calculated based on the genetic similarities between each pair of populations.
To calculate similarities, we need to create a set of “average” genotypes for each population. One way to do this is to calculate the frequency of the major homozygote for each SNP in each population.
## Load custom function for calculating major homozygote frequency:
source('calcMajorFreq.R')
## Apply this function to the SNP count data
## Takes a few seconds
popFreqs = calcMajorFreq(snpAnsCount, snpAns$Population)# Make a cluster tree based on these frequencies
plot(hclust(dist(popFreqs)),hang=-1)The way in which the “leaves” of the tree cluster, reflects the similarity between the items analysed. In this case the populations are grouped based on their genetic similarity, as measured by these particular loci. Here it appears that the AMR (Ad-Mixed American), and SAS (South Asian) super-populations are most genetically similar, then EUR (European), followed by EAS (East Asian). The AFR (African) super-population is the most genetically dissimilar relative to the others. It is likely that these groupings reflect (to some degree) migration patterns and shared ancestry experienced by these populations.
2. Epigenetics - CpGs and DNA methylation
Overview
One epigenetic mechanism for which there is a lot of data available is DNA methylation. You can think of this as an underline that occurs under two specific letters, in DNA language this is CG, but you could think of it as words with a “GN” next to each other e.g magnetic, gnat, gnome, significant. Remind them about context – different tissues etc.
- Is it worth creating a bar chart similar to the genotype one for a given CpG so they get a feel for it (100% methylated, 50% or 0%) in a given cell? We can rig it obviously.
# create some tissue methylation data
tissue_data <- data.frame(CG_1 = c(100, 50, 0), CG_2 = c(50, 50, 100), CG_3 = c(0, 100, 50))
rownames(tissue_data) <- c("blood", "heart", "brain")
tissue_data <- as.matrix(tissue_data)
barplot(tissue_data, beside=TRUE, legend.text=TRUE, ylim=c(0,100), col = c('cadetblue', 'darkred', 'grey65'), ylab = '% methylation',
args.legend = list(bty = 'n'))2. load data
The data for this workshop has been provided for you. There are 2 main data files:
- annotation for the Illumina 450k methylation array
- methylation data (beta values) for 5 whole blood and 5 buccal samples
Load this into R:
library(minfi)
# meth.anno <- readRDS('data/Ilmn450K_anno.RDS')
betas <- readRDS('data/blood_buccal.RDS')3. set up samples/data and basic QC explore
We’ll create an object called tissues which provides identification of our blood and buccal samples:
# define samples
tissues <- c(rep(0, 5), rep(1, 5))
tissues <- as.factor(tissues)
levels(tissues) <- c('Blood', 'Buccal')Now we’ll explore the distribution of methylation values (beta values) by creating a density plot. A methylation distribution for a given individual follows a bimodal distribution, with peaks around 0 (unmethylated) and 1 (methylated).
Create a density plot for all samples, what do you see?
# create a density plot
# densityPlot(betas, sampGroups = tissues)- Tree of tissues all CpGs
Another way of exploring a data set is by cluster analysis.
Here we perform hierarchical clustering on the 10 samples using all available data:
# Hierarchical Clustering
d <- t(betas)
rownames(d) <- tissues
d <- dist(d, method = "minkowski", p = 2)
hc <- hclust(d)
plot(hc, main="Cluster on all probes", labels = tissues, xlab = '', sub="")What do you conclude from the above?
- Tree of tissues with tissue biomarker CpGs
Run dmpFinder:
# minfi dmpfinder
results <- dmpFinder(betas, pheno = tissues, type = "categorical", qCutoff = 1)
results <- na.omit(results)Now extract these sites into an object called top.results:
top.results <- results[results$qval <= 0.00005,]We previously performed clustering on all CpG sites, now we’re going to cluster on those that pass our defined threshold.
First extract a smaller matrix of all significant CpG sites:
# subset the beta matrix by the top markers and cluster
sig.betas <- betas[rownames(betas) %in% rownames(top.results), ]Now perform the clustering:
# Hierarchical Clustering
d <- t(sig.betas)
rownames(d) <- tissues
d <- dist(d, method = "minkowski", p = 2)
hc <- hclust(d)
plot(hc, main = "Cluster on selected top tissue discrimination CpG sites", labels = tissues,
xlab = "", sub = "")What do you see?
- Perhaps here introduce the idea of blood cell types. So epigenetics – identify a “new”, previously unrecognised cell type or inflammatory response related to disease?
## Read in the methylation data for 10 different cell populations
blood_data <- read.csv('data/publicbeta_data.csv', head = T, row.names = 1, as.is = F)
colnames(blood_data) <- gsub('\\.', '', colnames(blood_data))We’ll now create another tree:
# Hierarchical Clustering
d <- t(blood_data)
rownames(d) <- colnames(blood_data)
d <- dist(d, method = "minkowski", p = 2)
hc <- hclust(d)
plot(hc, main = "Cluster on selected top cell type discrimination CpG sites",
labels = colnames(blood_data), xlab = "", sub = "")Another PCA:
library(FactoMineR)
library(formatR)
blood_data2 <- data.frame(cells = as.factor(colnames(blood_data)), t(blood_data))
# clean cell type groups
blood_data2$cells <- gsub('_.*', '', blood_data2$cells)
blood_data2$cells <- as.factor(blood_data2$cells)
cell_pca <- PCA(blood_data2, quali.sup = 1, graph = F)
plot.PCA(cell_pca, habillage = 'cells', label = 'quali', axes = c(1,2))plot.PCA(cell_pca, habillage = 'cells', label = 'quali', axes = c(1,3))plot.PCA(cell_pca, habillage = 'cells', label = 'quali', axes = c(2,3))plot.PCA(cell_pca, habillage = 'cells', label = 'quali', axes = c(3,4))Nice 3D view:
If time and can find an appropriate dataset show an analysis of environmental exposure, and/or strong disease phenotype
Could we have an example/test run where say we set them a case study – Miles idea forensics – can they talk us through identifying a cell type at a crime scene
Think of potential discussion points throughout.